Automatic Construction of Clean Broad-Coverage Translation Lexicons
نویسنده
چکیده
Word level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques However these techniques are easily fooled by indirect associations pairs of unrelated words whose statistical properties resemble those of mutual translations Indirect associations pollute the resulting translation lexicons drastically reducing their precision This paper presents an iterative lexicon cleaning method On each iteration most of the remaining incor rect lexicon entries are ltered out without signi cant degradation in recall This lexicon cleaning technique can produce translation lexicons with recall and preci sion both exceeding as well as dictionary sized translation lexicons that are over correct
منابع مشابه
Automatic construction of translation lexicons
The paper describes a statistical approach to automatic extraction of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall coverage and processing time. The comparison with other works shows that our method i...
متن کاملAutomatic Methods to Supplement Broad-Coverage Subcategorization Lexicons
The paper describes a system for extracting subcategorization frames of verbs not found in existing broad-coverage valency lexicons. The system uses two parameters: the results of a finite-state parser and the predictions of a set of automatically learned rules which transfer subcategorization frames from cognate verbs. An in-depth evaluation quantified the contribution of the individual parame...
متن کاملAutomatic Construction of Chinese-English Translation Lexicons
The process of constructing translation lexicons from parallel texts (bitexts) can be broken down into three stages: mapping bitext correspondence, counting co-occurrences, and estimating a translation model. Stateof-the-art techniques for accomplishing each stage of the process had already been developed, but only for bitexts involving fairly similar languages. Correct and efficient implementa...
متن کاملConstructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons...
متن کاملA Scalable Architecture for Bilingual Lexicography
SABLE is a Scalable Architecture for Bilingual LExicography. It is designed to produce clean broad-coverage translation lexicons from raw, unaligned parallel texts. Its black-box functionality makes it suitable for naive users. The architecture has been implemented for different language pairs, and has been tested on very large and noisy input. SABLE does not rely on language-specific resources...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cmp-lg/9607037 شماره
صفحات -
تاریخ انتشار 1996